Appendix A — Assignment A

Simple linear regression

Instructions

  1. You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity.

  2. Make R code chunks to insert code and type your answer outside the code chunks. Ensure that the solution is written neatly enough to understand and grade.

  3. Render the file as HTML to submit. For theoretical questions, you can either type the answer and include the solutions in this file, or write the solution on paper, scan and submit separately.

  4. The assignment is worth 100 points, and is due on 7th October 2023 at 11:59 pm.

  5. Five points are properly formatting the assignment. The breakdown is as follows:

  • Must be an HTML file rendered using Quarto (the theory part may be scanned and submitted separately) (2 pts).
  • There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.). There is no piece of unnecessary / redundant code, and no unnecessary / redundant text (1 pt)
  • Final answers of each question are written clearly (1 pt).
  • The proofs are legible, and clearly written with reasoning provided for every step. They are easy to follow and understand (1 pt)

A.1

The first step in using the capital asset pricing model (CAPM) is to estimate the stock’s beta \((\beta)\) using the market model. The market model can be written as:

\(R_{it} = \alpha_i + \beta_iR_{mt} + \epsilon_{it},\)

where \(R_{it}\) is the excess return for security \(i\) at time \(t\), \(R_{mt}\) is the excess return on a proxy for the market portfolio at time \(t\), and \(\epsilon_t\) is an iid random disturbance term. The coefficient beta in this case is the CAPM beta for security \(i\).

Suppose that you had estimated \(\beta\) for a stock as \(\hat{\beta}=1.147\). The standard error associated with this coefficient \(SE(\hat{\beta})\) is estimated to be 0.0548. A city analyst has told you that this security closely follows the market, but that it is no more risky, on average, than the market. This can be tested by the null hypotheses that the value of beta \((\beta)\) is one. The model is estimated over 62 daily observations. Test this hypothesis against a one-sided alternative that the security is more risky than the market \((\beta>1)\). Consider Type 1 error \((\alpha)\) as \(1\%\). Write down the null and alternative hypothesis. What do you conclude?

Does your conclusion change if you consider the Type 1 error \((\alpha)\) as \(0.1\%\)?

(4 + 1 = 5 points)

A.2

When asked to state the simple linear regression model, a students wrote it as follows: \(E(Y_i) = \beta_0 + \beta_1X_1 + \epsilon_i\). Is this correct? Justify your answer.

(2 points)

A.3

Consider the simple linear regression model below:

\(Y_i = \beta_0 + \beta_1X_1 + \epsilon_i, i = 1,...,n\)

where:

\(\beta_0 = 100, \beta_1 = 20,\) and \(\sigma^2 =5\). The following assumptions are made for the model:

A. \(E(\epsilon_i) = 0,\)

B. \(Var(\epsilon_i) = \sigma^2,\)

C. \(Cov(\epsilon_i, \epsilon_j)=0 \ \forall i, j; i\ne j\)

An observation \(Y\) is made for \(X=5\)

Can you state the exact probability that \(Y\) will fall between 195 and 205? If yes, then compute the probability. If not, then state any reasonable assumption(s) you need to make to compute the probability, and then compute the probability.

(1 + 1 + 4 = 6 points)

A.4

The Toluca Company manufactures refrigeration equipment in lots of varying sizes. The dataset toluca.txt consists of of two columns - LotSize and WorkHours required to produce the lot.

When asked for a point estimate of the expected work hours for lot sizes of 30 pieces, a person gave the estimate 202 because that is the mean number of WorkHours for the three observations of LotSize = 30 pieces in the dataset. Is there an issue with this approach? Explain. If there is an issue, then suggest a better approach and use it to estimate the expected work hours for lot sizes of 30 pieces.

(2 + 2 + 4 = 8 points)

A.5

Consider the simple linear regression model below:

\(\log(Y)=\beta_0+\beta_1\log(X)+\epsilon\)

Interpret the coefficient \(\beta_1\), where you mention the approximate expected percentage increase in \(Y\) given an increase of \(1\%\) in \(X\).

Use the approximation: \(\log(1+x) = x\) if \(x<<1\)

(5 points)

A.6

The dataset ACT_GPA consists of the GPA at the end of freshmen year (\(Y\)) that can be predicted from the ACT score (\(X\)) of students of a college.

A.6.1

Obtain the least square estimates of the regression coefficients, and error standard deviation, and state the estimated regression function.

(5 points)

A.6.2

Obtain the maximum likelihood estimate of the error standard deviation. Is it the same as that obtained in the previous question? Why or why not? If it isn’t same, which estimate will you prefer - the MLE or the one obtained in the previous question and why?

(2 + 2 + 4 = 8 points)

A.6.3

Interpret the estimates of the regression coefficients and the error standard deviation as obtained in A.6.1. What is the increase in expected GPA for an increase of 2 points in the ACT score?

(6 + 2 = 8 points)

A.6.4

Does ACT have a statistically significant relationship with the GPA? Justify your answer.

(1 + 2 = 3 points)

A.6.5

Plot the estimated regression function and the data. Does the estimated regression function appear to fit the data well?

(3 + 1 = 4 points)

A.6.6

Include the 95% confidence and prediction intervals in the above plot.

(6 points)

A.6.7

Obtain a point estimate, and the 95% confidence and prediction intervals of the freshman GPA for students with an ACT score of \(30\).

(1 + 2 + 2 = 5 points)

A.6.8

The intercept of the model developed in Q4(a) is the expected GPA when the ACT score is zero. However, the ACT score can never be zero as the minimum possible ACT score is 1 (ref). So, should the intercept be removed from the model? Why or why not?

(2 + 4 = 6 points)

A.7

Consider the regression model in A.3, where the parameters are estimated using Maximum likelihood estimation. Let \(e_i\) denote the \(i^{th}\) residual. Prove that:

  1. \(\sum_{i = 1}^n \hat{Y}_ie_i = 0\)

  2. \(\sum_{i=1}^n Y_i = \sum_{i=1}^n \hat{Y}_i\)

  3. The regression line passes through the point (\(\bar{X}, \bar{Y}\))

(3 + 3 + 3 = 9 points)

A.8

Consider the regression model:

\(Y_i = \beta_0 + \epsilon_i, i = 1,...,n\), where \(\epsilon \sim N(0, \sigma^2)\)

Derive the maximum likelihood estimate of \(\beta_0\), and show whether it is a biased or unbiased estimate of \(\beta_0\).

(3 + 3 = 6 points)

A.9

Consider the regression model:

\(Y_i = \beta_1X_i + \epsilon_i, i = 1,...,n\), where \(\epsilon \sim N(0, \sigma^2)\)

Derive the maximum likelihood estimate of \(\beta_1\), and show whether it is a biased or unbiased estimate of \(\beta_1\).

(4 + 5 = 9 points)